Leverage incremental output between the inference and async engines to improve performance #4054

lvhan028 · 2025-10-20T13:32:54Z

Motivation

The current transport protocol between the async_engine and the inference engine causes 5+% performance degradation when logprobs are requested. This is because the protocol transmits the entire cumulative sequence of generated tokens in each iteration, resulting in redundant data transfer and processing latency.

Modification

To eliminate this redundancy, the protocol has been modified to transmit only the newly generated tokens and their associated metadata (e.g., logprobs) in each iteration.

Benchmark on H800

Serve a model by pytorch engine:

lmdeploy serve api_server Qwen/Qwen3-8B --backend pytorch --logprobs-mode raw_logprobs --enable-metrics

Benchmarked the /generate endpoint using https://gist.github.com/irexyc/add84faadbfdc229f28c7da3cf0d3ce8

python profile_restful_api.py --backend lmdeploy --dataset-path /nvme1/shared/ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name random --random-input-len 170 --random-output-len 2048 --random-range-ratio 0.9  --num-prompts 1024

Before:

============ Serving Benchmark Result ============
Backend:                                 lmdeploy  
Traffic request rate:                    inf       
Successful requests:                     1024      
Benchmark duration (s):                  319.86    
Total input tokens:                      165130    
Total generated tokens:                  1992686   
Total generated tokens (retokenized):    0         
Request throughput (req/s):              3.20      
Input token throughput (tok/s):          516.26    
Output token throughput (tok/s):         6229.89   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   214800.21 
Median E2E Latency (ms):                 220168.95 
---------------Time to First Token----------------
Mean TTFT (ms):                          2856.35   
Median TTFT (ms):                        2831.78   
P99 TTFT (ms):                           4512.24   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          109.07    
Median TPOT (ms):                        112.41    
P99 TPOT (ms):                           165.81    
---------------Inter-token Latency----------------
Mean ITL (ms):                           942.11    
Median ITL (ms):                         380.80    
P99 ITL (ms):                            1191.39   
==================================================

After:

============ Serving Benchmark Result ============
Backend:                                 lmdeploy  
Traffic request rate:                    inf       
Successful requests:                     1024      
Benchmark duration (s):                  305.74    
Total input tokens:                      165130    
Total generated tokens:                  1992686   
Total generated tokens (retokenized):    0         
Request throughput (req/s):              3.35      
Input token throughput (tok/s):          540.10    
Output token throughput (tok/s):         6517.59   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   204367.81 
Median E2E Latency (ms):                 209678.84 
---------------Time to First Token----------------
Mean TTFT (ms):                          2798.76   
Median TTFT (ms):                        2643.68   
P99 TTFT (ms):                           4446.10   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          103.68    
Median TPOT (ms):                        106.52    
P99 TPOT (ms):                           158.08    
---------------Inter-token Latency----------------
Mean ITL (ms):                           560.02    
Median ITL (ms):                         225.71    
P99 ITL (ms):                            766.30    
==================================================

…o improve performance.

use incremental output for pt engine

into redefine-protocol

lvhan028 and others added 8 commits October 20, 2025 10:54

Leverage incremental output between the inference and async engines t…

bce1adc

…o improve performance.

fixj

72066c0

fix

df8e547

improve

2d774f2

Merge branch 'main' into redefine-protocol

d867372

use incremental output for pt engine

7faf706

remove check if isinstance(result, EngineOutput)

d0c961d

Merge pull request #5 from irexyc/redefine-protocol-pt

6126b23

use incremental output for pt engine

lvhan028 requested review from grimoire and lzhangzz October 22, 2025 03:22

lvhan028 added the improvement label Oct 22, 2025

lvhan028 mentioned this pull request Oct 22, 2025

incrementally send / recv EngineOutput in ray mp engine #4048

Closed

lvhan028 added 2 commits October 22, 2025 12:20

Merge branch 'redefine-protocol' of https://github.com/lvhan028/lmdeploy

e4b9014

into redefine-protocol

update

6b5f3b7

grimoire approved these changes Oct 22, 2025

View reviewed changes

lvhan028 mentioned this pull request Oct 22, 2025

bump version to v0.10.2 #4062

Merged

3 tasks

lvhan028 added 2 commits October 23, 2025 10:59

fix ut

1dc01dc

update comments

ac65cf7

lvhan028 force-pushed the redefine-protocol branch from 325e027 to ac65cf7 Compare October 23, 2025 03:32

fix metrics

756b7fb

lzhangzz approved these changes Oct 23, 2025

View reviewed changes

lvhan028 merged commit 4af69f2 into InternLM:main Oct 23, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Leverage incremental output between the inference and async engines to improve performance #4054

Leverage incremental output between the inference and async engines to improve performance #4054

Uh oh!

lvhan028 commented Oct 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Leverage incremental output between the inference and async engines to improve performance #4054

Leverage incremental output between the inference and async engines to improve performance #4054

Uh oh!

Conversation

lvhan028 commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modification

Benchmark on H800

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lvhan028 commented Oct 20, 2025 •

edited

Loading